63 - Deep Learning - Plain Version 2020 [ID:21197]
50 von 57 angezeigt

Welcome back to deep learning. So today I want to continue to talk to you about known operators and in particular I want to show you how to embed these known operations into the network and what kind of theoretical implications are created by this.

So the key phrase will be let's not reinvent the wheel.

So we go back all the way to our universal approximation theorem and the universal approximation theorem told us that we can find a one hidden layer representation that approximates any continuous function with an approximation capital U and this capital U is supposed to be very close and it's computed as a superposition linear combination of the two.

of sigmoid functions.

We know that there is a bound Epsilon U and Epsilon U tells us the maximum difference between the original function and the approximated function and this is exactly one hidden layer neural network.

Well, this is nice, but we are not really interested in one hidden layer neural networks right.

We would be interested in an approach that we coined precision learning. So here the idea is that we want to mix approximators with known operations and embed them into the network.

And specifically the configuration that I have here is a little big for theoretical analysis. So let's go to a little simpler problem and here we just say okay we have a two layer network where we have a transform from X using U.

So this is a vector to vector transform. This is why it's in boldface and then we have some transform G and G now takes the output of U and produces a scalar value and this is then essentially the definition of f of X.

So here we know that f of X is composed of two different functions.

So this is already the first postulate here that we need in order to look into known operator learning.

So we now want to approximate sequences and if I look at f we can see that there's essentially three choices how we can approximate it.

We can approximate only U then this would give us f U. We could approximate only G.

This would give us f G or we could approximate both of them that is then G of U of X and both are approximations.

Now with any of these approximations I'm introducing an error and the error can be described as EU if I approximate U as EG if I approximate G and as EF if I approximate both.

So let's look into the math and see what we can do with those definitions.

Well of course we can start with f of X we use the definition of f of X then the definition gives us G of U of X.

We can start approximating G. Now if we approximate G we do some error so the error has to be added back.

This is then shown here in the next line we can see we can also use the definition of capital G that is a linear combination of sigmoid functions.

And here we then use component wise the original function UJ because it is a vectorial function.

And of course we have the different weights GJ the bias G0 and the error that we introduced by computing EG.

So we can also now approximate U component wise and then we introduce an approximation and then the approximation of course also introduces an error.

So this is nice but we kind of get stuck here because the error of the approximation of U is inside of the sigmoid function and all the other errors are outside.

So what can we do about this? Well at least we can look into error bounds.

So let's have a look at error bounds and the key idea here is that we use the property of the sigmoid function that it has a Lipschitz bound.

So there is a maximum slope that occurs in this function and that is denoted by LS meaning that if I'm at the position X and I move to a direction E then I can always find an upper bound by taking the magnitude of E times the highest slope that occurs in the function plus the original function value.

So it's a linear extrapolation and you can see this in this animation here. So we essentially have the two white cones that always will be above or below the function.

Obviously we can also construct a lower bound using the Lipschitz property.

Well now what can we do with this?

We can now go ahead and use it for our purposes but we just run into the next problem. Our Lipschitz bound here doesn't hold for linear combinations.

So you see that we are actually interested in multiplying this with some weight gj and as soon as I take a negative gj then this would essentially mean that our inequality flips.

So this is not cool but we can find an alternative formulation like the bottom one.

So we simply have to use an absolute value when we multiply with the Lipschitz step size in order to remain above the function all the time.

And running through the proof here is kind of tedious. This is why I brought you the two images here. So we reformulated this and we took all the terms on the right hand side, subtracted them and moved them to the left hand side which means that all of these terms need to be in combination lower than zero.

And if you do that for positive and negative gj you can see in the two plots that independent of the choice of e and x I am always below zero.

And you can also go to the original reference if you're interested in the formal proof for this.

So now let's use this inequality and we can see now that we can finally get our euj out of the bracket, out of the sigmoid function and we get an upper bound by using this kind of approximation.

And then we can see if we arrange the terms correctly that the first couple of terms are simply the definition of capital F of x. So this is the approximation using g and u and this then can be simplified to just write down capital F of x.

And this is plus the sum over the components of gj times the Lipschitz times the absolute value of the error plus the error that we introduced by g.

Now we can essentially subtract capital F of x and if we do so we can see that F of x minus capital F of x is nothing else than the error introduced when doing this approximation.

So this is simply EF. So we have an upper bound for the error in EF that is composed as the sum on the right hand side and we can still replace the EG by epsilon g which is the upper bound to EG and it's still an upper bound to EF.

Now these are all upper bounds but the same idea can also be used to get a lower bound and you see that then we have this negative sum and this is always a lower bound.

Now if we have the upper and the lower bound then we can see that the magnitude of EF is bound by the sum over the components gj times the Lipschitz constant times the error plus epsilon g.

This is interesting because here we see that this is essentially the error of U of x amplified with the structure of g plus the error introduced by g.

So if we know U the error U cancels out and if we know g the error g cancels out and of course if we know both there is no error because there's nothing that we have to learn.

So we can see that this bound has this very nice properties and if we now relate this to classical pattern recognition then we could interpret U of x as a feature extractor and g of x as a classifier.

So you see that if we do errors in U of x they get potentially amplified by g of x and this also gives us hints why in classical pattern recognition there was this very high focus on the feature extraction.

Because any feature that you don't extract correctly that is simply missing and this is also a big advantage of our deep learning approaches that we can also optimize the feature extraction with respect to the classification.

Note that when deriving all of this we required Lipschitz continuity.

Okay now you may say this is only for two layers but we also extended this for deep networks so you can actually do this once you have the two layer constellation you can find a proof by recursion that there's also a bound for deep networks then you essentially get a sum over the layers to find this upper bound.

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:10:58 Min

Aufnahmedatum

2020-10-12

Hochgeladen am

2020-10-12 23:36:24

Sprache

en-US

Deep Learning - Known Operator Learning Part 2

In this video, we show the theoretical foundations of known operator learning and how to derive maximal error bounds.

For reminders to watch the new video follow on Twitter or LinkedIn.

Further Reading:
A gentle Introduction to Deep Learning

Einbetten
Wordpress FAU Plugin
iFrame
Teilen